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Abstract 

Many cognitive processes rely on the ability of the brain to hold sequences of events 
in short-term memory. Recent studies have revealed that such memory can be read 
out from the transient dynamics of a network of neurons. However, the memory per- 
formance of such a network in buffering past information has only been rigorously 
estimated in networks of linear neurons. When signal gain is kept low, so that neurons 
operate primarily in the linear part of their response nonlinearity, the memory lifetime 
is bounded by the square root of the network size. In this work, I demonstrate that 
it is possible to achieve a memory lifetime almost proportional to the network size, 
"an extensive memory lifetime" , when the nonlinearity of neurons is appropriately uti- 
lized. The analysis of neural activity revealed that nonlinear dynamics prevented the 
accumulation of noise by partially removing noise in each time step. With this error- 
correcting mechanism, I demonstrate that a memory lifetime of order N/ log N can be 
achieved. 
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1 Introduction 

Buffering a sequence of events in the activity of neurons is an important property of the 
brain that is necessary to carry out many cognitive tasks (Baddeley |2000[ Baeg et al. 



20031 |de Fockert et al.j [200T| [Hahnloser et al.j |2002| [Miinte et al.j [T998| [Orlov et al.j [2000 



Pastalkova et al. , 2008). The fundamental limit of the capacity of the sequential memory 
is, however, largely unknown. Several works have suggested that a long memory lifetime 
can arise as a network property of neurons, where individual neurons typically have lim- 



ited memory (Goldman, 2009 Lim and Goldman 2011; White et al. 2004). However, the 



structures and operating regimes suitable for a network of neurons to buffer a sequence of 
events is also unknown. This paper investigates the limit of such sequential memory for 
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buffering past stimuli in the presence of dynamical noise. More specifically, we examine how 
reconstructions of past stimuli degrade as we trace them back into the past. This kind of 
working memory generally improves with the size of the network. Hence, important questions 
are: How the memory lifetime scales with network size, and what kind of network structure 
achieves the longest memory lifetime. The scaling of the memory lifetime to the network size 



has been rigorously characterized only under limited conditions ( Ganguli et al. , 2008 White 
et al. , 2004). In particular, the memory lifetime for non-saturating linear neurons can be 



proportional to the network size, N, which is, from an information theoretical perspective. 



the best possible situation for reconstructing all sequences of non-sparse input ( Ganguli and 



Sompolinsky 2010). This is called the extensive memory lifetime. Ganguli et al. also esti- 



mated the memory lifetime of a network of neurons with response nonlinearity but under a 
rather restricted condition where the signal gain was kept small so that neurons operated in 
their linear regime. Under this c ondition, the sequential memory lifetime is upper-bounded 
by ~ viV (Ganguli et al. , 2008). However, as we will see in the following, fine-tuning of a 



network parameter is necessary for this to work. 

I explore in this paper a network structure that yields a long-lasting sequential memory 
that is longer than the bound previously set for nonlinear neurons. The network structure 
that I explore is a simple feedforward network with a fixed number of neurons in one layer. 
This network architecture has been studied in the context of synchronous-firing chains, i.e., 



synfire chains (Abeles, 1982 Bienenstock 



1995 



Diesmann et al. , 1999; Herrmann et al. 



1995 Kumar et al. , 2008 ; Rossum et al. 



2002 



Vogels and Abbott 2005). In the current 



context, reliable propagation of synfire activity is used to maintain information on past 
sequences. Although reliable propagation of synfire activity in the presence of noise has 
been reported several times, quantitative characterization of such reliability has been only 
partially achieved. In particular, previous studies did not systematically evaluate the effect 



of occasional strong noise that spontaneously ignites or blocks synfire activity (Bienenstock 



1995 Herrmann et al. , 1995). As we will see, this occasional large noise prevents a network 
from achieving an extensive memory lifetime. The scaling of the memory lifetime to the 
network size in the presence of such noise has not yet been reported to my knowledge. 

In this paper, I analytically evaluate the effects of response nonlinearity and noise on 
the performance of sequential memory. The main result is the following: If we require a 
network of A^ neurons to hold / bits of information about stimulus presented at each time, 
the achievable memory lifetime is proportional to (N/I)/ log{N/I), which is much longer 
than the previously proposed order viV memory, assuming a small gain. Moreover, the non- 
linear dynamics of neurons drastically improves the tolerance of working-memory to noise 
levels, compared to the previously proposed semi-linear dynamical regime. Numerical simu- 
lations show that complex firing sequences of leaky integrate-and-fire neurons are successfully 
buffered by this network architecture. 



2 Result 

In order to derive the memory lifetime of feedforward networks, I consider a simple firing-rate 
model of neurons with saturating response nonlinearity. We first aim at reconstructing a 
sequence of binary input but we will show later that it is straightforward to generalize this 
scheme to reconstruct sequences of analog input. 

2.1 Evaluating the memory lifetime for interacting nonlinear neu- 
rons 

To study what effects response nonlinearity has on the memory lifetime in a simple system, 
we consider the dynamics of a homogeneous feedforward network of L layers (see, FigjI]), 
where each layer has n neurons, and the total number of neurons in the network is given by 
A^ = nL. Let us consider discrete-time dynamics here for the sake of simplicity. The activity 
of neuron i in layer / + 1 at time t + 1 is modeled as 



r,(t + 1) = y^Wijr^it) + aUt + 1) , (1^ 




where (f) is the response nonlinearity. Si is a set of n neurons in layer Z, Wij = 1/n is 
the uniform synaptic strength from neuron j in layer / to neuron i in layer / + 1, cr is the 
magnitude of noise, and C,i is an independent white Gaussian random variable of unit variance 
that describes the postsynaptic noise to neuron i. Here, in order to better distinguish the 
effect of the nonlinearity from the signal-to-noise ratio, we fix the magnitude of synaptic 
strength and instead change the slope of the nonlinearitjy and the parameter a that controls 
the noise level relative to the input from the previous layer. The time-dependent input to 
the network is described by T'o(t) and fed only to the first layer. For simplicity, we assume a 
sequence of binary input that takes either a positive or negative value of the same magnitude. 
Because the input to the network at each time step propagates separately in this feedforward 
model, we drop the time index, t, in the following and focus on the propagation of a certain 
binary input signal, tq. The information about input degrades as the activity travels down 
the chain due to noise. The task is to find the maximum number of layers L until which the 
information about the binary input reliably propagates. 

Let fi = ^ J2ies ''^i ^^ ^^^ average activity of the neurons in layer /. Because of the 
uniform coupling strengths, Wij = 1/n, between adjacent layers the average activity of layer 
/ + 1 is given in terms of the average activity of layer / by 



n+i 

n 



IJ2^(^^ + ^^^)- (2) 



Note that Eq. [2] is derived by averaging both sides of Eq. [T] with i G iS^+i. Here, r^+i is 
the sum of n independent and identically distributed random variables. Hence, by using 



"'^We change the slope of by introducing scahng parameter /3, with which the nonhncarity is written as 
(x) = f{l3x) for some fixed nonhnearity ip. 
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Figure 1: Simple feedforward network model of size A^. There are n neurons in each layer, 
and there are L = N/n successive layers. Each neuron is connected to all the neurons in the 
previous layer with a uniform synaptic strength. We study how the input of strength, tq, 
propagates down the feedforward chain. 



the central limit theorem, the conditional distribution, P(ri+i|ri), approaches a Gaussian 
distribution 



P{ri+i\ri) ^Ar(i2{ri), 



v{ri] 



n 



(3) 



for a large n. This distribution is characterized by the conditional mean, fi{ri), and the 
conditional variance, v{fi)/n, calculated as: 



Mr) = j<P{f + aOD^, 

v{f) = I <f{f + ai)Di- ii^{f) 



(4) 



where DS, = ^ z^- d,^ describes a Gaussian integral. These two quantities /i and v are plotted 
in Fig. [2] for 0(x) = tanh(/3x). With this conditional probability, and for a given input tq, 
the probability distribution of the average activity in the final layer can be formally described 
as P{fL\ro) = J ■ ■ ■ J P{rL\fL-i)P{fL_2\rL-3) ■ ■ ■ P{ri\rQ)dfLdfL-2 ■ ■ ■ dfi, which is sufficient 
to characterize the memory degradation at layer L. 

In the following, let us consider a class of odd saturating nonlinearitjrl such as 0(x) = 
tanh(/3x). For this class of functions, we can show that the conditional mean, /i, is also odd 
and saturating. This means that the slope of fi is steepest at the origin (/i'(x) < /i'(0) for 
X 7^ 0). Let us first consider a trivial case, where the conditional variance, v{f)/n, is small 



^More precisely, we consider a class of functions that satisfy: 0(x) — 
i>'{\x\) > (increasing), and 0"(|a;|) < (saturating). 



-(/)(-x) (odd), \(l){x)\ < 1 (bounded). 





Figure 2: Conditional probability P{fi^i\fi) of the average activity. The blue line is the 
conditional mean, fi{fi), and the pink band is the conditional standard deviation, ^Jv{fi)/n. 
The brown line indicates the condition f^+i = f^, and the black dashed line shows the 
nonlinear response function 0(x) = tanh(/3x). (A) The slope of is /3 = 0.8. Here, f = ^{f) 
has only one attracting solution at f = 0. Hence, the activity tends to decay toward 0. 
(B) The slope of is /3 = 2. Here, f/+i = nifi) has three fixed points: two (f ^ ±0.9) 
are attractive and one (f = 0) is repulsive. When f is close to one of the attracting fixed 
points, noise does not accumulate because it is partially removed at each time step. Other 
parameters are set to a = 0.3 and n = 10. 



and negligible. In this case, the dynamics is well approximated by a deterministic update 
equation of the mean activity, f^+i = fi{fi). Because /^(O) = for odd 0, f = is always a 
fixed point in this dynamics. Let us call the slope /i'(0) gain. If the gain is small (/i'(0) < 1), 
f = is a unique and stable fixed point because /x is saturating. The average activity must 
decay toward (Fig. |2]A.). On the other hand, if the gain is large (/i'(0) > 1), f = becomes 
an unstable fixed point, and two stable fixed points (one positive and one negative) appear 
(Fig. |2l3). Hence, the average activity converges to either the positive or the negative fixed 
point depending on the sign of the input Tq. In general, when the conditional variance 
v{f)/n is not negligible, the activity fluctuates around the deterministic dynamics described 
above, but the trend is similar. One important property is that, while the conditional mean, 
/i(f), is independent of the number of neurons per layer, n, the conditional variance v{f)/n 
decreases as n increases. Hence, when the gain is small (/i'(0) < 1), increasing the number of 
neurons in each layer does not prevent the average activity from decaying. This means that 
the memory lifetime is order 1, i.e., the memory lifetime does not scale with the number of 
neurons in each layer and is determined by the gain. In contrast to the above case, when the 
gain is large (/i'(0) > 1) and the activity approaches one of the attracting fixed points, the 
memory degrades due to the conditional variance, v{f)/n, and, hence, due to finite n. This 



memory decay can be slowed down by increasing the number of neurons in each layer. Even 
with additional noise at each time step, the attracting force toward one of the stable fixed 
points can partially remove this noise. This error-correcting dynamics that prevents noise 
from accumulating becomes essential for the nearly extensive memory lifetime as we will see 
in what follows. 

Let us introduce an intuitive overview of how ~ N/ log A^ memory lifetime is derived. 
According to the attractor dynamics described above, the average activity near an attracting 
fixed point can only be driven closer to the other attracting fixed-point when uncommonly 
large noise occurs. We estimate how often this rare flipping occurs. The central limit theorem 
of Eq. [3] states that the effective noise level inversely decreases with the number of neurons 



in each layer, n. Kramers' escape rate, e.g'.(Risken, 1996), yields that the probability of the 



activity flipping from one attracting basin to the other in a particular layer is approximately 
e~", where constant factors and higher order terms are neglected. Hence, the probability 
that no flipping occurs throughout L = N/n successive layers is about (1 — e~"')^ ~ exp ( — 
(iV/n)e~"'), with which the input is correctly estimated from the activity of the final layer. 
It is easy to see that in order to keep this probability finite in the limit of large N, n should 
increase asymptotically faster than or equal to log A^. Therefore, the best achievable memory 
lifetime is L ~ N/ log A^. 

Let us more rigorously evaluate a lower bound for the memory lifetime when the dynamics 
is sufficiently nonlinear (/x'(0) > 1). In this case, we can choose a positive constant Vc which 
satisfies n{rc) > r^ (c.f. Fig. [2^3). The basic strategy is to assure with high probability that 
the average activity in the final layer is f^ > r^ if the input is also tq > r^. Because of the 
symmetry of the system, this condition also guarantees that the activity is fi < — Tc if the 
input is ro < —re- Using the Gaussian assumptions of P(f;+i|f/) for given f; (c.f. Eq. [3]), 
the probability of f^+i > Tc is expressed in terms of the error function by 



/■oo 

P{ri+i > rc\fi) = / dfi+iP{fi+i\fi) 

Jrr_ 




\/v{ri) 
Therefore, for any fi > Vc, the probability of Eq. |5] is lower bounded by 



(5) 



P{ri+i > rdn) > 1 - -erfc ( J -z^ ) , (6) 



where 



Zc = mm _ > = > (7) 

is a positive constant because ^i{rc) > Tc- To obtain the first inequahty in Eq. [TJwe used two 
properties: the variance of Eq. [4]is upper bounded by v{f) < 1 — /U^(f) if |0(x)| < 1, and /i is 
monotonically increasing with monotonically increasing (because fi'{r) = J (f)'{r + aC,)DC, > 

6 



0). The right hand side of Eq. |6]can take a value close to 1 for large n (> ^/zl), suggesting 
that the average activity tends to remain in the same interval (f > Tc) as the previous 
layer with high probability. If we assume that the input to the first layer is tq > Vc, the 
probability that the average activity will reliably propagate through all layers without ever 
escaping below Vc is 

Pc = Pi{fi > rc}ti\ro) 

L 

l[[P{fi\fi_i)dfi] 
'■'^ 1=1 



1 erfc 

2 



're 
L 

a n 

1=1 



1 erfc 

2 

where Eq. [6] is used in the third line. To guarantee a certain level of reliability, Pc, at the 
end of the chain, the length of the chain, L, must be restricted by Eq. |8| i.e., the length of 
the chain is at most 



logPc 



log [l — I erfc 
-21ogP, 



c)] 



erfc ( 

C 



=) 



n 62' 



(9) 



where C = ~\/2ttZc log Pc and, in the last two lines, higher order terms are neglected assum- 
ing a large n. Thus, the number of layers where activity can reliably propagate increases as 
the number of neurons in each layer increases. There is a constraint, on the other hand, on 
the total number of neurons in the network, i.e., 



N = nL = Crt'l'^ e^' 



(10) 



where the second equality follows assuming Eq. [9| For large A^, this equation yields to the 
leading order, n ~ {'^l z^) log A^. Therefore, a memory lifetime of 



N zl N 
n ^ 2\ogN 



(11) 



can be achieved with n ~ (2/-2f) log N neurons per layer. Because of the symmetry of the 
system, we can repeat a similar argument for rg < — r^ and find that the scaling is the same. 
The proportionality factor z^ in Eq. 11 describes the signal-to-noise ratio. This factor takes 



a small value if Vc is small compared with the noise, showing that there is a minimum input 



intensity for the network to store sequential memory reliably. Although Eq. 11 is a lower 



bound for the memory lifetime as the inequahty in Eq. [8] is not necessarily tight, we can 
expect that the derived scaling behavior to A^ is correct. This is because we can also upper 
bound the probability of Eq. [5] using the same functional form as Eq. |6] but with another 
constant factor greater than Zc- The derived sc aling of the meniory l ifetime of order iV/ log N 
is much better than the previously suggested (Ganguli et al. , 2008[ ) scaling of order V^ for 



large N . 

Although only a limited amount of information (at most 1 bit) can be transmitted by 
the above network, it is easy to increase the amount of information through the parallel use 
of k chains. Provided there is independent input to each chain, the information transmit- 
ted through the parallel chains becomes k times larger than that through a single chain. 
While this solution requires k times more neurons than a single chain, this does not alter 
the scaling of the memory lifetime to A^. Therefore, the memory lifetime for reliably re- 
constructing sequences of ~ fc bits of information in each time step is ~ {N / k) / \og{N / k) . 
More quantitatively, based on the assumption that the input to each chain independently 
takes a positive or negative value of the same magnitude with equal probability, the mutual 
information between the input and the average activity in the final layer is, by symmetry, 

I{r,-,fL) = k{l-H2{Pc)) (12) 

in bits, where the noise entropy, H2{Pc) = —Pc log2 Pc — {^ — Pc) log2(l — Pc), is about 0.5 
bits if Pc = 0.9. This means that the nonlinear feedforward chains can sustain / bits of 
information about the input for a duration proportional to (N/I) log{N/I). 

2.2 Numerical verification of the nearly extensive memory lifetime 

As an example, let us consider the sign nonlinearity, 0(x) = sgn(x). The corresponding 

conditional mean is given by fi{f) = erf ( -^ j . With these binary neurons, where each 

neuron takes either an active (r = +1) or inactive (r = —1) state, it is easy to numerically 
evaluate the memory lifetime because the average activity f in each layer can only take n + 1 
discrete values. For example, when m [m = 0,1, . . . ,n) neurons in layer / -|- 1 are active and 
n — m neurons are inactive, the average activity in this layer is 

m n — m 2m — n , , 

n+i = = . 13 

n n n 

Hence, the conditional probability distribution is given in terms of a binomial distribution 
by 



2m — n 
P ri+i = 



n 



niH::)(iif^)"Yi^i , (14) 



where (1 + fi{ri))/2 and (1 — fi{ri))/2 are the respective probabilities that a neuron in layer 
/ + 1 will take an active or inactive state. The conditional distribution of Eq. 14 over 



all possible input/output states can be described as an n -|- 1 by n + 1 square matrix. In 



particular, the distribution of average activity in the final layer, P(fi|ro), can be computed 
by evaluating the Lth power of this matrix. We assume a decoder that estimates the sign of 
To based on the sign of f^. This means that the performance is good if P{fL > 0|ro) ~ 1 for 
positive To; and this condition also assures P{fL < 0|ro) ~ 1 for negative tq by the symmetry. 




100 1000 10 



Figure 3: Memory lifetime of binary neurons is scaled close to the network size. The 
memory lifetime, L, was evaluated at two noise levels: a = 0.4 (blue solid) and a = 0.6 
(red solid), and two inputs: tq = 1.0 (Left) and Vq = 0.3 (Right). The offset of two curves 
at different noise levels reflects the different number of neurons in each layer, n, chosen to 
achieve the 90% decoding criterion. The scaling behavior was well fltted by ~ N/logN in 
all cases as suggested by the theoretical result. 

Figure [3] plots the number of layers L, beyond which probability P{fL > 0|ro) falls to less 
than 90% at two different noise levels, a = 0.4 and 0.6. The number of neurons in each layer, 
n, was chosen to maximize the memory lifetime under a constraint that the total number of 
neurons is N. We used two inputs, tq = 1.0 and tq = 0.3, in the simulation, but the scaling 
of the memory lifetime was not sensitive to the input tq. We can see from Fig. |3]that the 
memory lifetime is asymptotically proportional to N/ log A^ as predicted by the theory. Note 
that if the input is too small compared to the noise level {tq <^ a), the asymptotic behavior 
is apparent only at very large A^ because a large number of neurons (^ cr'^/f'o) are required 
to achieve the 90% reliability criterion even in the flrst layer. 



2.3 Nonlinear dynamics provides robust working memory. 

We saw in the previous section that a nearly extensive memory lifetime can be achieved by 
utilizing the error-correcting property of nonlinear neural dynamics. In this section, I will 
show that the sequential memory in this regime is much more robust to network parameters 



than the previously proposed solution in the semi-linear regime (Ganguli et al. , 2008). 



To better compare the sequential memory in a nonlinear vs. semi-linear regime, let us 
clarify our goal. The goal is to maximize the length of the feedforward chain (s), while 



maintaining I bits of mutual information about the input until the end of the chain(s). In 
this section we consider that the input, tq, is randomly drawn from a Gaussian distribution 
of mean zero and variance (Tq. Because the information about the input can only degrade 
as activity propagates down the layers, it suffices to constrain the information in the final 
layer, i.e., the mutual information between the input and the average activity in the final 
layer must satisfy 

/(ro;fi)>/. (15) 



Let us now review the sequential memory in the semi-linear regime (Ganguli et al. , 2008). 
The feedforward network structure in this setting is similar to the one used in this paper 
except that the number of neurons in each layer, ni for layer L can vary. The total number 
of neurons is given by A^ = '^i^ini. The derivation of Eqs. p^andk^is analogous with the 
variable number of neurons in each layer. When the activity is small, the conditional mean 
of Eq. [i] is well approximated by a linear function with slope (gain) /u'(0). As previously 
explained, if the gain is smaller than 1, the signal tends to decay toward zero, and if the 
gain is larger than 1, the signal tends to grow until nonlinearity eventually kicks in (c.f. Fig. 
2]). The optimal semi-linear solution is achieved by setting the gain equal to 1 so that the 
signal neither decays nor grows on average, and memory only degra des due to fiuc t uation 
of the activity. In this case, the memory lifetime can scale with yN (Ganguli et al. , 2008). 
To implement this semi-linear solution, however, some fine-tuning is required. For example, 
Ganguli et al. simply used 0(a;) = tanh(x) without explicitly considering the effect of noise 



that reduces the gain (Herrmann et al. , 1995). Because the slope of (j){x) = tanh(a;) is always 
less than 1 except at the origin, the gain must be strictly less than 1 in the presence of noise 
(//'(O) = f D^(p'{a^) < 1). This means that the signal must decay at each time step by the 
factor /i'(0) < 1, and because the gain is independent of the number of neurons, the memory 
lifetime is indeed order 1 rather than vN. To implement the memory lifetime of ~ vN, 
one needs to fine-tune parameters such as the slope /3 of nonlinearity 0(a;) = tanh(/3x). The 
solution of P that yields /i'(0) = 1 deviates from /3 = 1 as the noise level increases, and 
this difference becomes apparent at large A^. One should also note that the fine-tuning of 
the slope, /3, only provides a linear relation locally at around zero activity (f ^ 0). If the 
activity is large in magnitude, the signal tends to decay due to nonlinear effects. 

Suppose we fine-tuned the gain to 1 and assumed the linear input-output relation (which 
is not true if the signal and/or noise is large). It is easy then to estimate the mutual 
information between input and output because P(fi|ro) is approximately Gaussian. The 
signal-to-noise ratio in the final layer is described by cro/(X]«=i '^^/''^O) where the signal is 
preserved by setting the gain equal to 1, and the noise of variance cr^/rii is added in each 
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layer. Hence, under this semi-linear scheme, the mutual information is 

1 



/(^o;^l) 



< 



logs 

logs 
loga 



1 + 






-1 



1 + 



(Tn 



^-HzEti-^ 



1 + 



LV2 



(16) 



where the second line follows due to the convexity of the 1/x function, and the equality 
holds if and only if the number of neurons in each layer is uniform {rii = N/L for all /) (Lim 



and Goldman, 2011). Hence, the best achievable memory lifetime that guarantees / bits of 



information about input under this semi-linear scheme is 



mm 



ctq 



a 



N 



22^ - 1 ' 



N 



(17) 



where min takes the minimum argument. This means that unless the number of neurons 
is small, i.e., N < {ctq / a"^) / (2'^^ — 1), the memory lifetime of the semi-linear network is 
proportional to a/ZV and exponentially decreases with J, suggesting that it is difficult to 
maintain precise information about input in this setting. If a large amount of information 



is required, however, we can apply the parallel scheme used in Sec. 2.1 to the semi-linear 
memory by dividing A^ neurons to k parallel chains, where each chain consists of N/k neurons. 
Provided there is independent input to each chain, a single chain only needs to retain I/k 
bits of information in the final layers. Hence, the memory lifetime of the semi-linear parallel 
chains becomes 



mm 



O-Q 



a 



{N/k) N 



221 /k _ 1 ' ^ 



In Eq. 18, the first and second arguments of min(-) are increasing and decreasing functions 



of fc, respectively. Hence, the best memory lifetime for large N is given by 



o-o 



A^ 



a V (2 log 2)/ 



(19) 



at /c ~ (a/ao)y^21og2)iV7, where each chain needs to retain much less information than 
the single chain scheme. Although the use of parallel chains does not alter the asymptotic 
scaling of memory lifetime to A^, it is also beneficial in the semi-linear regime to buffer a large 
amount of information. Comparing the memory lifetime in two regimes, ~ (N/I) log{N/I) 
in the nonlinear regime and ~ yN/I in the semi-linear regime, we can conclude that the 
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Figure 4: The nonlinear network outperformed the semi-hnear network for large network 
sizes or with large noise. A Gaussian random input with zero mean and standard devia- 
tion (To = 0.5 was provided to the first layer, and the performance was measured by the 
mutual information of the input and the activity in the final layer. The sequential mem- 
ory performance was numerically examined at noise level a, shown in each panel and with 
two nonlinear functions: (A) the hyperbolic tangent nonlinearity (f){x) = tanh(/3a;) and (B) 
a piecewise-linear function 0(x) = (3x for |x| < 1//3 with hard saturating bounds (LHB). 
Three types of networks were compared with approximately the same s ize, N, and the same 
number of layers L ^ y2N for a fair comparison: the fan-out network (Ganguli et al. , 2008) 
with /3 = 1 and a linearly increasing number of neurons along deeper layers (n; = /; order :. 
memory lifetime); the semi-linear network with a /3 solution that yielded a gain equal to 1 
and a fixed number of neurons in each layer (rii = N/L; order y/N memory lifetime); and the 
nonlinear network with the same network architecture as the semi-linear network but with 
f3 = 2 [ni = N/L, order N/logN memory lifetime). The semi-linear network always showed 
better performance than the fan-out network and the nonlinear network was superior to the 
other two except at a small network size and with a small amount of noise. 



memory lifetime to retain the same amount of information increases asymptotically faster 
with N in the nonlinear regime than in the semi-linear regime. 
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In practice, the network size A^ is always finite. Hence, whether we see the differences in 
the order 1, viV, and N/logN memory hfetimes depends on the network size. I therefore 
investigated three different networks with about the same number of total neurons A^ and 
the same number of layers L: the fan-out network ([Ganguli et al. 2008) with a linearly 



increasing number of neurons along layers {ni = I) and /3 = 1; the semi-linear network 
with a fixed number of neurons in each layer and a /3 solution that yielded a gain equal to 
1; and the nonlinear network, which had the same network architecture as the semi-linear 
network, but with (3 = 2 (and a gain greater than 1). For a fair comparison, the number 
of neurons per layer was adjusted for the semi-linear and nonlinear networks so that the 
total number of neurons was approximately the same as that of the fan-out network, i.e., 
N = L{L + l)/2. Figure IlK. shows the mutual information between the Gaussian input and 
the activity in the final layer for the three networks introduced above for various network 
sizes with (f){x) = tanh(/3a;). When noise was small [a = 0.1), the performance of the semi- 
linear and fan-out networks was superior to that of the non-linear network with a small 
network size. This was because the nonlinear network squashed the analog inputs to almost 
binary values, reducing information down to about one bit, but the fan-out and semi-linear 
networks were able to retain more than one bit of information at a small network size and 
with low noise. As expected, the semi-linear network preserved more information than the 
fan-out network because of the fine-tuning of /3 and the optimal network architecture. When 
the network size was larger than 500, the nonlinear network preserved more information 
than the other networks. At a slightly higher noise level {a = 0.2), the nonlinear network 
always outperformed the other two in the range of network sizes examined. Note that the 
mutual information of the nonlinear network increased with the network size here because the 
number of layers was matched to the fan-out network. This means that the nonlinear chain 
can be significantly longer than the fan-out and semi-linear chains to achieve a comparable 
level of mutual information. Figure |4|3 shows results analogous to Fig. |4]A but using for 
(f) a linear function with hard saturating bounds (LHB), i.e., (f){x) = sgn(x) min(|/3x|, 1). 
The results were qualitatively similar to Fig. |4]A but the fan-out and semi-linear networks 
performed better in this figure because LHB retained linearity for a larger range of input 
than the hyperbolic tangent function. In particular, compared at the same noise levels, the 
crossover point of the semi-linear and nonlinear networks with LHB nonlinearity lay at a 
larger A^ than with the tangent hyperbolic nonlinearitjjj Because the nonlinear functions 
used in |4]A and |4p share the same slope at the origin, the result shows that not only A^ 
and 0", but also the nonlinearity affects the crossover point of the semi-linear and nonlinear 
networks. 

We should also note that when more biological neuron models are used, it is even more 
difficult for the semi-linear model to set the gain equal to 1 because the nonlinearity is not 
fixed but changes with the dynamical input properties in those models. Hence, the semi- 
linear memory requires some elaborate additional mechanism to achieve full performance. 



•^Although the semi-linear network showed better performance than the nonhnear network for the entire 
range of N examined in Fig. Hh with a = 0.1, the difference in the scahng of the memory Ufetime ensures a 
crossover at a larger network size. 
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Another important and potentially testable difference between the semi-linear and non-linear 
memory is how the memory lifetime, L, scales with the variance, v, of the network activity, 
i.e., L r^ v~^ with some exponent 7. While the semi-linear memory provides 7 = 1/2 from 



Eq. 18, the non-linear memory provides 7 = 1 from Eq. 11 



2.4 Synfire chains can reliably buffer complex spike sequences of 
leaky integrate-and-fire neurons. 

The abstract firing rate model studied in the previous sections was suitable for mathematical 
analyses but was less biologically realistic. However, all the main properties explored in the 
previous sections should hold even with more realistic models. The key properties that 
yielded the nearly extensive sequential memory lifetime were the feedforward propagation of 
activity (that prevents stimuli presented at different timings from mixing) and the attracting 
dynamics (that implements error-correction). To illustrate this point, a feedforward network 
of leaky integrate-and-fire (LIE) neurons was explored. Detailed parameter studies were, 
however, not the scope of this paper. 



A network of iV = 4000 current-based LIE neurons was simulated, e.g., (Dayan and 



Abbott 2001 



Vogels and Abbott, 2005). The network consists of fc = 10 independent 
synfire chains, where each chain has L = 20 layers and n = 20 neurons in each layer. The 
membrane dynamics of neurons i{i = 1,2, ... , N) is described by 

r^(t) = -Viit) + El + x,{t) + fibg + (TbMt)^ (20) 

where Vi is the membrane potential of neuron z, r = 10 ms is the membrane time- const ant. 
El = —70 mV is the resting potential, Xi is the input to neuron i from other neurons in a 
local network, fi^g = 7 mV is the mean background input level, ^j is white Gaussian noise 
of unit variance, and a^g = 5 mV describes the magnitude of background input fluctuation. 
When the membrane potential reaches the threshold value of Vth = —50 mV, the neuron 
emits a spike and the membrane potential is reset to K-eset = — 70 mV. After the spike, the 
membrane potential is fixed at Vrest for the duration of the refractory period, r^e/ = 2 ms. 
Input to neuron i is calculated according to 

r.^(t) = -x.(t) + J2J2''^At - tf) (21) 

i=i / 

where r^ = 5 ms, Wij is the synaptic strength from neuron j to i, 6{t) is the Dirac delta 

( f) 
function, and tj is the /th spike time of neuron j. This means that when neuron i receives 

a spike from neuron j at t, , its membrane potential is depolarized by Wij[e~^^~*^ '''^ — 

g-(t-tj ''^'']/{t — Tx) for t > tj . We measure the synaptic strength Wij using the peak 
amplitude of the excitatory postsynaptic potential, which is about Wij * 50 s^^ for the current 
set of parameters. The synaptic strength, Wij, takes a uniform non-zero value, w, if the layer 
of neuron i is next to the layer of neuron j, and takes zero otherwise. The feedforward 
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synaptic strength between adjacent layers is set uniformly to 1.0 mV except in Fig. [Sf 
where w is varied as a control parameter. Each chain is independently stimulated by 10-Hz 
random Poisson pulses, upon which all the neurons in the first layer are depolarized by 10 
mV according to the time course of excitatory synaptic input. 
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Figure 5: A feedforward network of leaky integrate-and-fire neurons reliably buffered spike 
patterns. (A) Top: Membrane potential of a single neuron. Middle: Spike-timing of all 
neurons in the network. The network consisted of fc = 10 independent synfire chains, where 
each synfire chain had L = 20 layers and n = 20 neurons in each layer. The oblique patterns 
describe feedforward propagation of synfire activity. Bottom: The population firing rate of 
all the neurons averaged in 10-ms bins. (B) The spiking pattern of the first layers (blue 
dots) was well preserved even until the final 20th layers (red circles). The spike pattern of 
the final layers was shifted so that the spike overlap with the first layers was maximized. 
The feedforward synaptic strengths were set to 1 mV. Note that input pulses were somewhat 
degraded by background noise even in the first layer. (C) The spike overlap with the input 
pulses in the 1st, 10th, and 20th layers plotted for different feedforward synaptic strengths. 

Figure [5]A. shows the model behavior. The top panel shows the membrane potential of a 
single neuron. The middle panel shows the spiking pattern of all neurons, where neurons were 
indexed first within the same layer of the same chain, then across chains, and finally across 
layers. The oblique arrangements of spiking patterns in the middle panel demonstrates that 
most of the synchronous firing patters evoked by external input pulses successfully reached 
the final layers in about 100 ms. The bottom panel shows that the overall population firing 
rate of the network was kept at about 10 Hz. Figure |5j3 shows the spike patterns of the first 
and last layers. To better show the similarity, the spike time of the last layers was temporally 
shifted so that the spiking activity in these two layers could be viewed closer together. The 
figure demonstrates that the precise spike pattern was well preserved even until the final 
layers. To quantify the similarity of two spike trains (sums of Dirac delta functions) Si{t) 
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and S2(t), we define the inner product {81,82) = /q /q 8i{t)D(t — t')s2{t')dtdt', using the 
entire duration of simulations, T, and a smoothing kernel D{t) = exp(— t^/(2r|,))/A/27rr|, 
with T£) = 10 ms. The overlap of the two spike trains are then measured, using the correlation 
coefficient, by (S*!, 82) / \/ {81, 8i){82, 82). The overlap of the external input pulses and the 
spiking activity in different layers are compared more systematically in Fig. [5p for various 
values of the feedforward synaptic strengths, w. Note that the amplitude of external input 
pulses to the first layers of the chains was always fixed at 10 mV. The key parameter for the 
signal propagation was the effective input amplitude, nw, which is the product of synaptic 
strengths and the number of neurons in each layer. When this effective coupling strength 
was too small, the activity could not be successfully propagated to the next layers, and when 
the effective coupling strength was too strong, even a spontaneous firing of a single neuron 
was sufficient to activate most of the neurons in the next layer. For the network structure 
explored here, the best overlap was achieved using about 1 mV of feedforward synaptic 
strength. The figure suggests that the fine-tuning of the synaptic strength is not critical for 
the memory lifetime because the difference between the overlaps in the 10th and 20th layers 
did not expand rapidly as mistuning from the optimal parameter value increased. 



3 Discussion 



I estimated the memory lifetime achieved by coupled nonlinear neurons. In contrast t o the 
previously proposed semi-linear scheme that provided the order y/N memory lifetime ( Gan- 



guli et al. , 2008), I have shown that an order N/\ogN memory lifetime can be achieved by 



appropriately using nonlinear dynamics. The derived asymptotic scaling was invariant to 
the accuracy of the information buffered. The proposed nonlinear network outperformed a 
previously proposed semi-linear scheme in a wide range of parameters, in particular, with 
a large number of neurons and large noise. I have also demonstrated that the previously 
proposed semi-linear scheme is sensitive to the noise level, i.e., a small increase in the noise 
level causes monotonic decay of the average activity, turning the order y/N memory lifetime 
to order 1. The nonlinear scheme proposed in this paper, on the other hand, uses large 
gain to prevent the activity from decaying and to alleviate the accumulation of noise us- 
ing error-correcting nonlinear dynamics. Because the mathematical model studied here is 
general, the result that a network is capable of buffering sequential input much longer than 
individual elements is potentially applicable to other systems beyond neural networks, such 
as, gene/protein and social networks. 

We considered in this paper the sequential memory task that aims to reconstruct a 
whole dynamical sequence of input after some delay. Note that this task is different from 



delayed matching working memory tasks (Fuster, 1973 Goldman- Rakic , 1995), where a brief 



stimulus is presented only at a certain time. The major difference is that stimuli presented 
at different timings can interfere with each other under the sequential memory task. This 
typically happens when recurrently connected networks are used to buffer a sequence of 
input (Biising et al. 2010 Lim and Goldman, 2011). For example, under the sequence 



generations by Hopfield-type networks (Hopfield, 1982; Kleinfeld, 1986 Sompolinsky and 
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Kanter 1986) or by winnerless competition networks (Bick and Rabinovich 2009; Seliger 



et al. , 2003), the activity converges to one of the learned patterns, and the presentation of 



a new pattern disrupts the current state. Hence, the delay-hne structure is often considered 
for a sequential memory task to prevent the interference of signals presented at different 



timings ( Ganguli et al. , 2008 ) . While nonlinear attracting dynamics has been utilized for 



non-sequential working memory tasks (Camperi and Wang 1998 Goldman et al. , 2003 



Koulakov 2001; Lisman et al. 1998), this study shows that its error-correcting property 



also provides long-lasting memory for a sequential memory task with a feed-forward network 
architecture. 

The feedforward network structure presented in this paper was studied in the context of 



synfire chains (Abeles 1991 Aertsen et al. 1996; Diesmann et al. 1999; Kumar et al. 2008 



Rossum et al. 2002; Vogels and Abbott 2005), where precise temporal patterns of spikes 



are their prominent characteristic. Temporally precise spiking patterns have been observed 



across different brain areas and different recording conditions (Hahnloser et al. , 2002 Ikegaya 



eTall [20041 |Ji and Wils"ont [20071 |Jin et al.j [20071 |Pastalkova et al.j [2008| [Takahashi et al. 
2010). There is also some experimental evidence suggesting that the synfire chain is the 



underlying network architecture in the brain for generating precise temporal sequences ( Long 



et al. 



et al. 



2010|. Although the effect of noise on the gain was systematically studied (Herrmann 
[1995), the contribution of occasional large noise that blocks or spontaneously ignites 



synfire activity (Bienenstock, 1995 Tetzlaff et al. , 2002) was not theoretically analyzed. In 



particular, the trade-off between the length of the chain and the reliability of activity being 
propagated for a fixed total number of neurons was not elucidated. I demonstrated that such 
occasional large noise prevents synfire chains from achieving an extensive memory lifetime, 
and the resulting ~ N/\ogN memory lifetime is the direct consequence of such noise. 



Reservoir computing (Jaeger and Haas, 2004; Maass et al. , 2002) was recently proposed 



as an attractive paradigm for universal and dynamical computation. This is one candidate 
network that can also perform sequential memory tasks (White et al. , 2004). According to 



this paradigm, dynamical input is provided to a pool of neurons, called a reservoir, which 
buffers the history of the input and extracts many useful features of the input sequence. 
Some linear readout units are placed on top of this reservoir and trained for a specific task, 
for example for reconstructing past input, while the reservoir itself remains task-nonspecific. 
One of the fundamental aspects of reservoir computing is that a reservoir buffers past input 
sequences so that the readout unit can successfully combine the history of the input stimuli. 
Although randomly connected networks are commonly used as the reservoir, the optimal 



structure of the reservoir is not yet known (Lazar et al. , 2009). The current study has shown 
that a feedforward structure is suitable to buffer sequences of events. Despite its benefit 
for sequential memory, making the whole network into a feedforward network is probably 
not a good idea. In addition to memory, it is also important for the reservoir to map 
input to a high dimensional "feature space" so that a linear readout has access to useful 



features (Bertschinger and Natschlager, 2004 Biising et al. , 2010; Jaeger and Haas, 2004 



Maass et al. , 2002 Sussillo and Abbott, 2009). The current study suggests instead that it 



would be a promising approach to embed feedforward chains with high gain as a memory- 
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specific sub-network for a wide class of tasks that requires a long-lasting sequential memory. 
This guarantees a memory lifetime that is nearly proportional to the size of that memory- 
specific sub-network. We should note that a straightforward implementation of randomly 
and recurrently connected nonlinear neurons only achieved a memory lifetime of order log N 



(Biising et al. 2010). In view of the fact that the best possible scaling of memory lifetime 



is ~ iV for non-sparse input sequences (Ganguli and Sompolinsky, 2010), it is clear that the 



error- correcting feedforward network studied in this paper with ~ N/ log A^ memory lifetime 
is a promising candidate for general dynamical computations requiring a recent history of 
activity. 
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